Identifying owners of provenance documents from their provenance network metrics.
In this notebook, we compared the classification accuracy on unbalanced (original) ProvStore dataset vs that on a balanced ProvStore dataset.
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("provstore/data.csv")
df.head()
Out[2]:
In [3]:
df.describe()
Out[3]:
In [4]:
# The number of each label in the dataset
df.label.value_counts()
Out[4]:
In [5]:
from analytics import test_classification
Cross Validation tests: We now run the cross validation tests on the dataset (df
) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [6]:
results, importances = test_classification(df)
In [7]:
from analytics import balance_smote
Balancing the data
With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the SMOTE Oversampling Method.
In [8]:
df = balance_smote(df)
In [9]:
results_bal, importances_bal = test_classification(df)
Result: The classifiers provide a higher performance on balanced data when provenance-specific metrics are used (either with the combined
or provenance
metrics sets). The classifiers trained on the generic
metrics set, however, performs better on the original, unbalanced data. It is, perhaps, some of the minority labels have more distinctive provenance-specific metrics, compared to their generic one; when more such samples are introduced in the balacing process, using only generic metrics cannot identify those samples as well, hence a lower accuracy.